{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "While Ragas has [a number of built-in metrics](./../../../concepts/metrics/available_metrics/index.md), you may find yourself needing to create a custom metric for your use case. This guide will help you do just that. \n",
    "\n",
    "For the sake of this tutorial, let's assume we want to build a custom metric that measures the hallucinations in a LLM application. While we do have a built-in metric called [Faithfulness][ragas.metrics.Faithfulness] which is similar but not exactly the same. `Faithfulness` measures the factual consistency of the generated answer against the given context while `Hallucinations` measures the presence of hallucinations in the generated answer.\n",
    "\n",
    "before we start, lets load the dataset and define the llm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# dataset\n",
    "from datasets import load_dataset\n",
    "\n",
    "from ragas import EvaluationDataset\n",
    "\n",
    "amnesty_qa = load_dataset(\"vibrantlabsai/amnesty_qa\", \"english_v3\")\n",
    "eval_dataset = EvaluationDataset.from_hf_dataset(amnesty_qa[\"eval\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "EvaluationDataset(features=['user_input', 'retrieved_contexts', 'response', 'reference'], len=20)"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "eval_dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "--8<--\n",
    "choose_evaluator_llm.md\n",
    "--8<--"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "from ragas.llms import llm_factory\n",
    "\n",
    "evaluator_llm = llm_factory(\"gpt-4o\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Aspect Critic - Simple Criteria Scoring\n",
    "\n",
    "[Aspect Critic](./../../../concepts/metrics/available_metrics/aspect_critic.md) that outputs a binary score for `definition` you provide. A simple pass/fail metric can be bring clarity and focus to what you are trying to measure and is a better alocation of effort than building a more complex metric from scratch, especially when starting out. \n",
    "\n",
    "Check out these resources to learn more about the effectiveness of having a simple pass/fail metric:\n",
    "\n",
    "- [Hamel's Blog on Creating LLM-as-a-Judge that drives Business Result](https://hamel.dev/blog/posts/llm-judge/#step-3-direct-the-domain-expert-to-make-passfail-judgments-with-critiques)\n",
    "- [Eugene's Blog on AlignEval](https://eugeneyan.com/writing/aligneval/#labeling-mode-look-at-the-data)\n",
    "\n",
    "Now let's create a simple pass/fail metric to measure the hallucinations in the dataset with Ragas."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from ragas.metrics import AspectCritic\n",
    "\n",
    "# you can init the metric with the evaluator llm\n",
    "hallucinations_binary = AspectCritic(\n",
    "    name=\"hallucinations_binary\",\n",
    "    definition=\"Did the model hallucinate or add any information that was not present in the retrieved context?\",\n",
    "    llm=evaluator_llm,\n",
    ")\n",
    "\n",
    "await hallucinations_binary.single_turn_ascore(eval_dataset[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Domain Specific Metrics or Rubric based Metrics\n",
    "\n",
    "Here we will build a rubric based metric that evaluates the data on a scale of 1 to 5 based on the rubric we provide. You can read more about the rubric based metrics [here](./../../../concepts/metrics/available_metrics/rubrics_based.md)\n",
    "\n",
    "For our example of building a hallucination metric, we will use the following rubric:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "rubric = {\n",
    "    \"score1_description\": \"There is no hallucination in the response. All the information in the response is present in the retrieved context.\",\n",
    "    \"score2_description\": \"There are no factual statements that are not present in the retrieved context but the response is not fully accurate and lacks important details.\",\n",
    "    \"score3_description\": \"There are many factual statements that are not present in the retrieved context.\",\n",
    "    \"score4_description\": \"The response contains some factual errors and lacks important details.\",\n",
    "    \"score5_description\": \"The model adds new information and statements that contradict the retrieved context.\",\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now lets init the metric with the rubric and evaluator llm and evaluate the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from ragas.metrics import RubricsScore\n",
    "\n",
    "hallucinations_rubric = RubricsScore(\n",
    "    name=\"hallucinations_rubric\", llm=evaluator_llm, rubrics=rubric\n",
    ")\n",
    "\n",
    "await hallucinations_rubric.single_turn_ascore(eval_dataset[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Custom Metrics\n",
    "\n",
    "If your use case is not covered by those two, you can build a custom metric by subclassing the base `Metric` class in Ragas but before that you have to ask yourself the following questions:\n",
    "\n",
    "1. Am I trying to build a single turn or multi turn metric? If yes, subclassing the `Metric` class along with either [SingleTurnMetric][ragas.metrics.base.SingleTurnMetric] or [MultiTurnMetric][ragas.metrics.base.MultiTurnMetric] depending on if you are evaluating single turn or multi turn interactions.\n",
    "\n",
    "2. Do I need to use LLMs to evaluate my metric? If yes, instead of subclassing the [Metric][ragas.metrics.base.Metric] class, subclassing the [MetricWithLLM][ragas.metrics.base.MetricWithLLM] class.\n",
    "\n",
    "3. Do I need to use embeddings to evaluate my metric? If yes, instead of subclassing the [Metric][ragas.metrics.base.Metric] class, subclassing the [MetricWithEmbeddings][ragas.metrics.base.MetricWithEmbeddings] class.\n",
    "\n",
    "4. Do I need to use both LLM and Embeddings to evaluate my metric? If yes, subclass both the [MetricWithLLM][ragas.metrics.base.MetricWithLLM] and [MetricWithEmbeddings][ragas.metrics.base.MetricWithEmbeddings] classes.\n",
    "\n",
    "\n",
    "For our example, we need to to use LLMs to evaluate our metric so we will subclass the [MetricWithLLM][ragas.metrics.base.MetricWithLLM] class and we are working for only single turn interactions for now so we will subclass the [SingleTurnMetric][ragas.metrics.base.SingleTurnMetric] class. \n",
    "\n",
    "As for the implementation, we will use the [Faithfulness][ragas.metrics.Faithfulness] metric to evaluate our metric to measure the hallucinations with the formula \n",
    "\n",
    "$$\n",
    "\\text{Hallucinations} = 1 - \\text{Faithfulness}\n",
    "$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "# we are going to create a dataclass that subclasses `MetricWithLLM` and `SingleTurnMetric`\n",
    "# import types\n",
    "import typing as t\n",
    "from dataclasses import dataclass, field\n",
    "\n",
    "from ragas.callbacks import Callbacks\n",
    "from ragas.dataset_schema import SingleTurnSample\n",
    "from ragas.metrics import Faithfulness\n",
    "\n",
    "# import the base classes\n",
    "from ragas.metrics.base import MetricType, MetricWithLLM, SingleTurnMetric\n",
    "\n",
    "\n",
    "@dataclass\n",
    "class HallucinationsMetric(MetricWithLLM, SingleTurnMetric):\n",
    "    # name of the metric\n",
    "    name: str = \"hallucinations_metric\"\n",
    "    # we need to define the required columns for the metric\n",
    "    _required_columns: t.Dict[MetricType, t.Set[str]] = field(\n",
    "        default_factory=lambda: {\n",
    "            MetricType.SINGLE_TURN: {\"user_input\", \"response\", \"retrieved_contexts\"}\n",
    "        }\n",
    "    )\n",
    "\n",
    "    def __post_init__(self):\n",
    "        # init the faithfulness metric\n",
    "        self.faithfulness_metric = Faithfulness(llm=self.llm)\n",
    "\n",
    "    async def _single_turn_ascore(\n",
    "        self, sample: SingleTurnSample, callbacks: Callbacks\n",
    "    ) -> float:\n",
    "        faithfulness_score = await self.faithfulness_metric.single_turn_ascore(\n",
    "            sample, callbacks\n",
    "        )\n",
    "        return 1 - faithfulness_score"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.5"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "hallucinations_metric = HallucinationsMetric(llm=evaluator_llm)\n",
    "\n",
    "await hallucinations_metric.single_turn_ascore(eval_dataset[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's evaluate the entire dataset with the metrics we have created."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from ragas import evaluate\n",
    "\n",
    "results = evaluate(\n",
    "    eval_dataset,\n",
    "    metrics=[hallucinations_metric, hallucinations_rubric, hallucinations_binary],\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'hallucinations_metric': 0.5932, 'hallucinations_rubric': 3.1500, 'hallucinations_binary': 0.1000}"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>user_input</th>\n",
       "      <th>retrieved_contexts</th>\n",
       "      <th>response</th>\n",
       "      <th>reference</th>\n",
       "      <th>hallucinations_metric</th>\n",
       "      <th>hallucinations_rubric</th>\n",
       "      <th>hallucinations_binary</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>What are the global implications of the USA Su...</td>\n",
       "      <td>[- In 2022, the USA Supreme Court handed down ...</td>\n",
       "      <td>The global implications of the USA Supreme Cou...</td>\n",
       "      <td>The global implications of the USA Supreme Cou...</td>\n",
       "      <td>0.423077</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Which companies are the main contributors to G...</td>\n",
       "      <td>[In recent years, there has been increasing pr...</td>\n",
       "      <td>According to the Carbon Majors database, the m...</td>\n",
       "      <td>According to the Carbon Majors database, the m...</td>\n",
       "      <td>0.862069</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Which private companies in the Americas are th...</td>\n",
       "      <td>[The issue of greenhouse gas emissions has bec...</td>\n",
       "      <td>According to the Carbon Majors database, the l...</td>\n",
       "      <td>The largest private companies in the Americas ...</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>What action did Amnesty International urge its...</td>\n",
       "      <td>[In the case of the Ogoni 9, Amnesty Internati...</td>\n",
       "      <td>Amnesty International urged its supporters to ...</td>\n",
       "      <td>Amnesty International urged its supporters to ...</td>\n",
       "      <td>0.400000</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>What are the recommendations made by Amnesty I...</td>\n",
       "      <td>[In recent years, Amnesty International has fo...</td>\n",
       "      <td>Amnesty International made several recommendat...</td>\n",
       "      <td>The recommendations made by Amnesty Internatio...</td>\n",
       "      <td>0.952381</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                          user_input  \\\n",
       "0  What are the global implications of the USA Su...   \n",
       "1  Which companies are the main contributors to G...   \n",
       "2  Which private companies in the Americas are th...   \n",
       "3  What action did Amnesty International urge its...   \n",
       "4  What are the recommendations made by Amnesty I...   \n",
       "\n",
       "                                  retrieved_contexts  \\\n",
       "0  [- In 2022, the USA Supreme Court handed down ...   \n",
       "1  [In recent years, there has been increasing pr...   \n",
       "2  [The issue of greenhouse gas emissions has bec...   \n",
       "3  [In the case of the Ogoni 9, Amnesty Internati...   \n",
       "4  [In recent years, Amnesty International has fo...   \n",
       "\n",
       "                                            response  \\\n",
       "0  The global implications of the USA Supreme Cou...   \n",
       "1  According to the Carbon Majors database, the m...   \n",
       "2  According to the Carbon Majors database, the l...   \n",
       "3  Amnesty International urged its supporters to ...   \n",
       "4  Amnesty International made several recommendat...   \n",
       "\n",
       "                                           reference  hallucinations_metric  \\\n",
       "0  The global implications of the USA Supreme Cou...               0.423077   \n",
       "1  According to the Carbon Majors database, the m...               0.862069   \n",
       "2  The largest private companies in the Americas ...               1.000000   \n",
       "3  Amnesty International urged its supporters to ...               0.400000   \n",
       "4  The recommendations made by Amnesty Internatio...               0.952381   \n",
       "\n",
       "   hallucinations_rubric  hallucinations_binary  \n",
       "0                      3                      0  \n",
       "1                      3                      0  \n",
       "2                      3                      0  \n",
       "3                      3                      0  \n",
       "4                      3                      0  "
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "results_df = results.to_pandas()\n",
    "results_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you want to learn more about how to build custom metrics, you can read the [Custom Metrics Advanced](./_write_your_own_metric_advanced.md) guide."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "ragas",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
